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(57) Abstract: This invention relates to methods and systems for in siUco or bioinformatic modeling of cellular metabolism. The 
invention includes methods and systems for modeling cellular metabolism of an otganism. comprising constructing a flux balance 
Q analysis model, and applying consliaints to the flux balance analysis model, the constraints selected from ihc set consisting uf: qual- 
^ itative kinetic information constraints, qualitative regulatory information constraints, and difierential DNA microarray experimental 
data constraints. In addition, the present invention provides for computational procedures for solving metabolic problems. 
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TITLE: METHOD AND SYSTEM FOR MODELING CELLULAR 
METABOLISM 

PRIORITY STATEMENT 
5 This appUcation claims priority to Provisional Patent ApplicatioirNo. 

60/260,713 filed January 10, 2001 and Provisional Patent Application No. 
60/278,535 filed March 23, 2001, both of which are herein mcorporated by 
reference in their entirety. 

10 FIELD OF THE INVENTION 

This invention relates to methods and systems for in silica or 
bioinfbrmatic modeling of cellular metabolism. More spedficaUy, although not 
exclusively, this invention relates to a fi^anework of models and methods that 
improve upon flux balance analysis (FBA) models through incorporation of 

15 particular constraints. These constraints incorporate^ without limitation, 
qualitative kinetic information, qualitative regulatory information, and/or 
DNA microarray experimental data. Further, the present invention relates to 
solving various metabolic problems using particular computational procedures. 

20 BACKGROUND OF THE INVENTION 

MetaboUc pathway engineering has attracted significant interest in 
recent years catalyzed by the rapidly increasing number of sequenced 
microbial genes. As of January 2001, over fifty microbial genomes were 
completely sequenced. Bioinformatic tools have allowed the functional 

25 assignment of 45 to 80 % of their coding regions. E. Pennisi, Science 277, 1432 
(1997). This newly acquired information is used in conjxmction with microbial 
mathematical models to calculate the response of metabolic networks after 
gene knockouts or additions. For example, guch information was used to 
increase ethanol production in inetdboUcally engineered E, coli cells. V. 

30 Hatzimanikatis, et aZ., Biotechnol. Il^oeng. 58, 164 (1998). 
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In general* mathematical models of cellular metabolism fall into two 
distinct categories, ones that incorporate kinetic and regulatory information 
and others that include only the stoichiometry of the reaction pathways. The 
first class of models matches cellular behavior at an original steady state and 

5 then employs kinetic and regulatory relations to ex amin e how the cell behaves 
away from this steady state in the presence of small perturbations brought 
about by environmental changes or enzyme engineering. The key advantage of 
this first class of methods is that upon application a unique point in the 
metabolite flux space is identified. The disadvantage is that the required 

10 kinetic parameters are difficult to estimate and their accuracy and 

reproducibility may deteriorate rapidly as the system moves far away from the 
original steady-state. 

The second dass of models, flux balance analyses, utilizes only the 
stoichiometric mass balances of the metabolic network and cellular 

15 composition information, in the absence of detailed kinetic and thermodynamic 
data, to identify boundaries for the flux distributions available to the celL 
Although micnroorganisms have evolved highly complex control structures that 
eventually collapse thede available boundaries into single points, flux balance 
models are still valuable in setting upper bounds for performance targets and 

20 in identifying 'ideal'" flux distributions. 

However, the versatility of flux balance analjrsis comes at the expense of 
unknowingly crossing kinetic or regulatory flux barriers. Flux balance model 
predictions must thus be cautiously interpreted as "ideal" flux distributions 
3^elding upper bounds to the performance of the metabohc network. The key 

25 advantage of flux balance models is that, by not requiring any numerical 

values for kinetic parameters or regulatory loops, they are straightforward to 
compile. The key disadvantage is that the obtained stoichiometric boundaries 
can be very wide and it is hard to envision that the biomass maximization 
conjecture, while useful ujider certain conditions, is generally applicable. 

30 It is therefore a primary object of the present invention to provide a 

method and system that improves upon the state of the art. 
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It is a further object of the present inventioii to provide a method and 
system that provides a framework for improving upon flux balance analysis 
models. 

It is a still further object of the present invention to provide a method 
5 and system that allows the predictive capabilities of flux balance analysis 
models to be enhanced 

Another object of the present invention is to provide a method and 
system that incorporates qualitative kinetic and/or regulatory information into 
a flux balance analysis modeL 
10 Yet another object of the present invention is to provide a method and 

i^tem that incorporates differential DNA. microarray experimental data into 
a flux balance analysis model. 

A further object of the present invention is to provide an unproved 
method and system for determining minimal reaction sets for growth. 
15 Another object of the present invention is to provide an improved 

method and system &r determining the eSect of environmental conditions on 
fiiiniTTiflT reaction sets. 

It is another object of the present invention to provide a method for 
<*ft|<>iiTfltiTig the response of metabolic networks after gene knockouts or 
20 additions. 

A still further object of the present invention is to provide a method and 
system for selecting mathematically optimal genes for recombmation. 

Another object of the present invention is to provide a method and 
system for identifying lethal gene deletions. 
25 Yet another object of the present invention is to provide a method and 

system for identifying gene therapeutic candidates for pathogenic microbes. 

A still further object of the present invention is to provide a method and 
system capable of testing hypotheses or objective functions. 

These and other objects, features and/or advantages of Uie present 
30 invention will become apparent from the specification and claims. 
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SUMMARY OF THE INVENTION 

This inveation indudes a firamework for in sUico or bioinfonnatic 
modeling of cellular metabolism. The jframework allows for an improvement to 
FBA models through incorporation of particular constraints. Preferably, these 

5 constraints are logic constraints that can be represented with binary variables. 
The framework provides for applying computational procedures in order to 
solve for model predictions. The model can be used to determine: how many 
and which foreign genes should be recombined into an existing metabolic 
network; which regulatory loops should be activated or inactivated so that a 

10 given metabolic target is optimizec^ how robust is a metaboHc network to gene 
deletion; what is the mathematically minimal set of genes capable of meeting 
certain growth demands for a given uptake environment; whether 
experimental flux data, under different substrates and carbon/o^gen uptake 
rates, are consistent with different hypothesized objective functions; and other 

15 metabolic problems. The restdts obtained from use of this framework can be 
applied in a niunber of areas of research or commercial interest related to 
metabolic engineering, including areas in the biological, chemical, 
pharmaceutical, life sciences, and medical fields. 

20 BRIEF DESCRIPTION OF THE DRAWINGS 

Fi^^ure 1 is a block diagram showing an overview of the present 
invention. 

Figure 2 is a diagram of multiple objective function slopes consistent 
with the same optimimi point. 
25 Figure 3 is a set of feasible objectives for different conditions. 

Figure 4 is a pictorial representation of stoichiometric boxmdaries, 
kinetic/regulatory barriers and a new optimal steady state. 

Figure 5 is a diagram of a simple network showing the application of 
logic constraints. 

30 Figure 6 is a diagram of two parts of a metaboUc network where 

bottlenecks are identified. 
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Figure 7 is a logarithmic plot of probability of flux/transcript ratio 
agreement versus transcript ratio. 

Figure 8 is a plot of miniimiTn acetate uptake rate versus a for a 0.3 hr^ 
growth rate. 

5 Figure 9 is a table of model predictions for maximum theoretical yields 

of seven amino adds for growth on glucose and acetate. 

Figure 10 is a diagram showing the pathway modifications introduced in 
a recombined network for growth on glucose. Figure 10 shows the difference 
between optimal E.coli and Universal arginine production pathways for 
10 growth on glucose, including (a) the pjrrophosphate dependent analog of 6- 
phosphofiructokinase in the Universal model replacing the ATP dependent 
version present in E. coli; and (b) carbamate kinase in the Universal model 
replacing carbamoyl phosphate synthetase from the E, coli network- 
Figure 11 is a graph showing the size of minimal reaction networks as a 
15 function of imposed growth rate for (a) growth on only glucose and (b) growth 
on a medium allowing for the uptake of any organic compound with a 
corresponding transport reaction. 

Figure 12 is a table showing modifications to the Pramanik and 
Eeasling modeL 

20 Figure 13 is a graph showing gene knockouts at various biomass 

production levels for growth on glucose. 

Figure 14 is a table showing genes selected for removal by knockout 

study. 

Figure 16 is a table showing model selections of enzymatic reactions 
25 that will enhance the amino add production capabilities of £L coli. 

Figure 16 illustrates optimal E. coli and Universal arginine production 
pathways for growth on glucose. The utilization of carbamate kinase and the 
pyrophosphate dependent analog of 6-phosphofiructokinase by the Universal 
arginine production pathway preserves a net of 3 ATP phosphoanhydride . 
30 bonds. 
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Figure 17 illustrates optimal E. coli and Universal axginine production 
pathways for growth on acetate. The incorporation of carbamate kinase and 
the pyrophosphate dependent analog of acetate kinase by the Universal 
pathway saves 3 ATP phosphoanhydride bonds. 
5 Figure 18 illustrates optimal asparagine production pathways for two 

modes of glucose utilization: glucokinase and the phosphotransferase system. 

Figure 19 illustrates an optimal Universal asparagine production 
pathway for growth on glucose, the Universal pathway conserves the 
equivalent of 1 ATP bond by using an ADP-forming aspartate-ammonia ligase 
10 instead of an AMP-£6rming version as shown in the previous figure. 

Figure 20 illustrates optimal E, coli and Universal histidine production 
pathways for growth on acetate. Both the energy efficiency (2 ATFs) and 
carbon conversion effidenQr of the Universal pathway are improved by the 
incorporation of a pyrophosphate dependent analog of PEP carboxykinase and 
15 glycine dehydrogenase, respectively. 

Figure 21 is a graph of a number of reactions in each minimal set as a 
function of the imposed growth demands for a glucose or acetate-only uptake 
environment. 

Figure 22 is a table showing evolution of minimal reaction sets under 
20 decreasing growth conditions. 

Figure 23 is a table showing metabolites uptaken or secreted at each 
target growth rate on an optimally engineered mediimi. 

Figures 24 and 25 are graphs of a number of reactions in each minimal 
set as a function of the imposed growth demands for an uptake environments 
25 allowing multiple organic uptakes. 

figure 26 is a table showing evolution of minimal reaction sets for a 
second set under decreasing growth requirements. 

Figure 27 is a table showing functional classification of minimal 
network reactions for growth on an optimally engineered medium. 
30 Figure 28 is a table showing a comparison of minimal metabolic 

gene/reaction sets based on functional classification. 

6 
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nTCTATT.TCD DESCRI PTION OF THE INVENTION 
1. OVERVIEW 

Figure 1 illustrates the framework of the present invention- This 

5 framework improves upon flux balance analysis (FBA) models through 

incorporation of particular constraints. These constraints incorporate, without 
limitation, qualitative kinetic information, quahtative regulatory information, 
and/or DNA microarray experimental data. Preferably, these constraints are 
logic constraints that can be represented with binary variables* The invention 

10 also provides for including computation procedures such as mixed-integer 

linear programming into the framework in order to use the model to arrive at a 
solution. As shown in Figure 1, the model provides for determining metabolic 
performance/robustness in the face of gene additions or deletions. In addition 
the model provides for testing whether experimental flux data, under different 

15 substrates and carbon/oxygen uptake rates are consistent with different 
hypothesized objective functions. 

The present invention involves a process for tightening the flux 
boundaries derived through flux balance models and subsequently probing the 
performance limits of metabolic networks in the presence of gene additions or 

20 deletions. Given the lai^ number of genes Oiuiidreds to thousands) available 
for recombination, present optimization ^formulations reach and sometimes 
exceed the limit of what can be solved with state of the art mixed-integer 
linear programming solvers. The present invention meets the dxial objectives 
of constructing modeling formulations that enable an effective query of the 

25 performance limits of metabohc networks and provide customized techniques 
for solving the resulting mixed-integer Unear programming problems. 



2. OBJECTIVE FUNCTION HYPOTHESIS TESTING 

The present invention provides for an unbiased, mathematically 
30 rigorous framework for testing whether experimental flux data, under 
different substrates and carbon/oxygen uptake rates, are consistent with 

7 
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different hypothesized objective functioiis. A. Varma and B. O. Palsson, 
Bio/Technology 12, 994 (1994); R. A. Majewski and M. M. Domach, Biotechnol. 
Bioeng. 35, 732 (1990). Rather than starting by postulating such an objective 
function, or even accepting that there exists an objective function governing 

5 cellular behavior, the quantitative framework of the present invention is based 
on inverse optimization that enables researchers to test, disprove or fine tune 
the consistency of different hypotheses. Note that while one can never prove 
the existence of such an objective function, the framework is useful for 
rigorously testing whether experimental data is consistent or inconsistent with 

10 a postulated objective function and how this may change under different 
enviromnental conditions. 

Inverse optimization concepts that were pioneered in geophysics for the 
identification of model parameters for systems reaching optimality given a set 
of observables are applied here. Specifically, the present invention provides for 

15 finHing the coefficients cj in a hypothesized linear objective function ^ j cjvj 

that are consistent with the subset of observed fluxes v) (e.g.> 
substrate/oxygen uptakes, growth rate, etc.). In general, not sii^ but rather 
a range of values for the coef&cients cj are consistent with a set of observed 
fluxes. This is illustrated with Figure 2A. in two dimensions. 

20 Any objective function cm + csos whose slope (-ca/ci) is between values 

a and h is consistent with the optimality of point A This gives rise to the 
range of values for ci and C2 denoted by the line segment between points B and 
C shown in Figure 2B that are consistent with the optimality of point A. Note 
that cx and cs were scaled so that ci + cj? = 1. In the general n dimensional case, 

25 the set of cj values in compliance with an optimum v*j forms a polytope. 

The general problem is addressed using the ideas introduced by Ahuja 
and Orlin (2001). R. K Ahuja, et aZ., Network Flows. Theorv. Algorithms, and 
A pplications. Prentice Hall, Englewood Cliffs, N.J. 1993. Given an observed 
subset of fluxes v) the set of objective function coefficients cj can be 

30 determined by finHmg all multiple optimal solutions of the restricted dual 

8 
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feasibility problem solved in the space of dual variables ai and the hnear 
objective function coefficients cj. 

The dual variables ai quantify the relative importance of a metabolite i 
towards improving the objective function. The solution of the restricted dual 
5 problem systematically characterizes the set of all possible cj values consistent 
with a subset of observed fluxes v] . These alternate optimal solutions can be 

obtained as a byproduct of the simplex method since any basic feasible solution 
from the simplex tableau defines a vertex of the polytope formed in the cj 
space. An alternate method using integer cuts can also be employed. S. I^e, et 

10 al, Comput, Chem. Eng. 24, 711 (2000), The present invention contemplates 
that with these techniques a determination can be made as to whether the 
polytopes overlap considerably (see Figure 3A) or migrate systematically (see 
F^;ure 3B) as the as the substrate choice or uptake rate of carbon/oxygen 
changes* This set of quantitative tools provides an imbiased framework for 

15 researchers to test the range of vahdity (if any) of diJQEerent hypotheses. 

3. KINETIC/REGULATORY LOGIC CONSTRAINTS 

flux balance models, by relying solely on stoichiometric balances and 
uptake rates, are guaranteed not to exclude any feasible flux distributions. 

20 However, this versatility may lead to overly optimistic expectations if the 

results are not interpreted properly. The flux distributions within the cell are 
ultimately uniquely determined by the regulatory mechanisms within the ceU, 
the kinetic characteristics of cellular en27mes, and the expression of these 
enzymes. Assuming cells operate in a stoichiometrically optimal fiBLshion may 

25 yield metabolic flux distributions not available to the celL The present 
invention provides for multiple methods for tightening the predicted 
stoichiometric flux boundaries by FBA models. A first strategy involves 
attempting to ensure that flux changes identified through FBA are consistent, 
in a qualitative sense, with the kinetics and regulatory loops of the metabolic 

30 network. By xmcovering unreachable domains within the stoichiometric flux 

boundaries the predictive capabilities are improved. A second strategy entails 

9 
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incorporating experimentally obtained data into the FBA model. The present 
invention includes a mathematically sound framework for superimposing DNA 
array differential egression data into FBA models. 

5 3.1 Kinetic and Regulatory Loop Consistency 

The key question addressed here is whether the optimal flux 
distributions predicted by the FBA models are reachable by the ceU or whether 
kinetic and/or regulatory boimdaries will prohibit the system from reaching 
the stoichiometric boundaries (see Figure 4). 

10 The key idea we propose to e^lore is to ensure, by using logic relations, 

that when, in response to environmental changes, the metabolic network shifts 
from one steady-state to another, up or down changes in metaboUte 
concentrations are consistent with up or down changes in reaction fluxes. 

Specifically (see Figure 5), flux i; can increase, in the absence of enzyme 

15 engineering, only if the concentration Ca of reactant A or the concentration CD 
of activator D increase or the concentration Ce of inhibitor E decreases. 
CJlearly, changes in the reaction fluxes and metabolite concentrations are 
coupled and even in the absence of detailed quantitative kinetic/regulatory 
information binding relations can be derived based on the direction of these 

20 changes. One such set of relations is described in detail below. 

SpedfiLcally, for any reaction flux i;; to increase above an initial base case 
value , either the concentration of a reactant must increase, or the 

concentration of an activator must increase, or the concentration of an 
inhibitor must decrease and vice versa. Incorporating these logic constraints 
25 into the FBA framework, requires first a regulation matrix F to be established 
describing the effect of metabolite i on reaction/. 

!1 if metabolite j activates reaction j 
- 1 if metabolite i inhibits reaction j 
0 if metabolitei has no effect on reaction j 

Such a regulation matrix can be constructed based on information from 
the EcoCyc and MetaCyc databases. P. D. Karp, et a{.. Nucleic Adds Res. 28, 
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55 (2000). Additional database resources exist also for non-E. coli reactions. 
M. Kanehisa and S. Goto, S., Nucleic Acids Res. 28, 29 (2000). Two sets of 0-1 
variables x* and zj are introduced to track up or down movements in metabolite 
concentrations and reaction fluxes respectively. 

' the concentration of metabolite i rises 
otherwise 



J 1 if 

^'=1 0 01 
^^=1 0 < 



if reaction flux 7 mcreases above origmal steady-state value 
otherwise 



By utilizing these 0-1 variables, we incorporate the following logic 
constraints into the FBA model for safeguarding against the violation of some 
of the kinetic and regulatory barriers. 

-a-^7)vr +^r^; 

10 Z^,+Z^/+E(i~^i)^^7. (2) 

tSg<0 U^g^l 

E(i-^')+E*«(i-'')+ E*/ ^^-H V/ (3) 

tSg<0 hFf^l iiFy—l 

Relation (1) ensures that > when0j= 1 aswellas uj <i;5 when;25 = 
0. Constraint (2) ensures that the concentration of a reactant must increase, 
the concentration of an activator must increase, or tiie concentration of an 
inhibitor must decrease for a reaction flux vj to increase above its uiitial base 
IS case value u^. Thelastconstraint (3) ensures that the concentration of a 

reactant must decrease, the concentration of an activator must decrease, or the 
concentration of an inhibitor must increase for a reaction flux vj to decrease 
below the initial base case value. Revisiting the example of figure 3 
constraints (?) and (3) for flux t; yield 
20 x^+X|,+(l-x^)^2„and (l-x^)+(l-^i>)+x^ ^l-^i 

Preliminary work on the alanine overproduction pathway for growth on 
glucose identified kinetic and regulatory bottlenecks that were not detectable 
by simple FBA models. 

The first step in this analysis was to obtain the initial base case values 
25 for the reaction fluxes. These were obtained by solving the LP problem for 

11 
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maxiinum biomass formation. The second step was to solve a second LP 
problem constramin^ ihe biomass production to 80% of its optimal value and 
allowing for the overproduction of alanine. The third step involved resolving 
the second step scenario with the incorporation of the kinetic and regulatory 

5 logic constraints described above. This study revealed that the overproduction 
of alanine (2.688 mmol/10 nimol GLC) subject to regulation is about 20% less 
than the value predicted by the FBA model (3.298 mmoiyiO mmol GLC) 
without the logic based regulatory constraints. More important than being 
able to identify this reduction is the capability to pinpoint spedfLc flux 

10 bottlenecks. Analysis of the reaction fluxes revealed two potential bottienecks 
limiting the performance of the network (see Figure 6). 

The first bottleneck (E^. 6A) arises because in addition to the pentose 
phosphate pathway reactions^ ribulose-5-phosphate (RL5P) is also a preciirsor 
to Ijrposaccharide (LPS) which is a component of biomass. Under less than 

15 optimal growth demands, the reaction flux from RL5P to biomass must 
decrease below its base case value. Thus the concentration of IIL5P must 
decrease (only regulator). Therefore, the flux through ribulose phoi^hate 3- 
epimerase caimot increase above its base case value because the concentration 
of the reactant KL5P is decreasing. This diverts additional flux through the 

20 ribose-S-phosphate isomerase reaction. The second bottieneck (Fig. 6B) occurs 
because during alanine overproduction, more flux must pass through pyruvate 
kinase than under maximum growth conditions. In this study, at the base 
case, the FBA model chose pyruvate kinase 11 which is one of the two 
isoenzymes of pyruvate kinase. However, the flux through pyruvate kinase n 

25 cannot increase above its base case value because the concentration of both its 
activator (AMP) and its reactants are decreasing. The FBA model including 
regulation partially circumvented this barrier by increasing the flux through 
pyruvate kinase I since the concentration of an activator (FDP) of this reaction 
is increasing. This example suggests that the logic constraints, by capturing 

30 some kinetic and regulatory information, are capable of identifying at least 
some of the bottlenecks imdetectable by simple FBA models without excluding 
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any feasible flux distributions. Identij^ing these key fluxes as described above 
and then engineering the enzymes and regulation around them provides a 
straightforward debottlenecking strategy. The present invention contemplates 
that one skilled in the art and having the benefit of this disclosure can 
5 construct additional logic constraints in the spirit of the ones described above 
to further "tighten" the predictions of flux balance models. 

4. DIFFERENTIAL DMA MICROARRAY CONSTRAINTS 

In addition to using qxiaUtative kinetic and/or qualitative regulatory 

10 information to define logic constraints for enhancing the predictive capabilities 
of flux balance models, tiie present invention provides for defining constraints 
based on experimental differential DNAmicroarray data. The recent 
development of DNA microarray technology has started to revolutionize the 
invest^tion of cellular global regulation on the whole genome scale. DNA 

15 microarrays enable the determination of differential transcription profiles, 
consisting of the relative expression levels of individual genes under various 
e^erimental conditions. This allows one to infer which genes are up- 
regulated or down-regulated as an organism responds to external 
environmental changes. Already such studies have been initiated for S. 

20 cerevisiae Qa. Wodicka, et oL, Nat BiotechnoL 15(1SX 1359 (1997)) and E. coli. 
G. S. Richmond, et al.. Nucleic Adds Res. 27(19), 3821 (1999). The output of 
such experiments is typically a set of gene transcript levels normalized with 
respect to an original steady-state. For example, the di£Eerential transcript 
levels of 111 genes, involved in central metabolism and key biosyntheses, have 

25 been measured for an E. coli strain grown on either a glycerol or acetate 
medium relative to a glucose reference condition. M. K Oh & J. C. liao, 
BiotechnoL Prog. 16(2), 278 (2000). Thus, a transcript level of 1.5 for a gene in 
the E. coli strain grown on acetate indicates that this gene is up-regulated by 
50% during growth on acetate as compared to growth on glucose. Although 

30 this methodology cannot detect any translational or post-translational genetic 
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regulation, with a few exceptions, the transcriptional regulation is the main 
mode of regulation at least in E, colL 

The key challenge is that at present transcript levels cannot be used to 
infer quantitative changes in the corresponding flux levels. Instead, at best 

5 only a qualitative statistical correlation between changes in fluxes and 

transcript levels can be drawn. Based on a quahtative linking between fluxes 
and transcript levels, the present invention uses 0-1 variables to captvire these 
trends. Let denote the normalized transcript level of gene coding for 
enz3rme catalyzing reaction j upon the environmental change L A value 

10 greater than one implies overexpression while a value less than one denotes 
underexpression. For the sake of simplicily of presentation, a one to one 
mapping of genes to reactions j ^kia assumed here. This can be easily 
relaxed if necessary. Consider binary variable wj defined as 

l,if flie transcript and flux level changes are in tiie same direction 
^ 0, otherwise 

15 Given the definition of binary variables wj, we can then write 

v;^v;-(i-iv;>;,ifr;>i (4) 

v'j^v'j-^i^-w'jlv^-'v'j]ifT}<l (5) 
where v) is the base flux level and ^ maximum allowable value. For wj = 
1, these two constraints correctly en&rceu'y 5: if > 1 and ^ ifTy 
<1 respectively. For 115 = 0 the two constraints yield obviously valid, non- 
20 binding constraints v'j tO and v'j u~ respectively. Perfect correlation 
between transcript and flux levels would have implied that all wj are equal to 
one. However, experimental studies have demonstrated that not 100% but 
rather on average about 80% of the genes exhibit transcript and flux levels 
f^liflnging in the same direction. Moreover, the fiurther firom unity the value of 
25 the transcript level is, the more likely it is for it to agree with the flux change 
directiocL This motivates a probabilistic description for quantigdng the 
likelihood that transcript changes translate to corresponding flux level 
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changes in the same direction, SpeciiBcally, we will construct a statistical 
model of the form, 



:^|forr^>l, P,.=l-|exi|- 



,forr^<l 



The scale 7^"''^ is chosen so as to control the range over which Pj " 
5 remains away from one. A value of 0.622 impUes that 100% overexpression (T) 
= 2) or underexpression (Tj = V2) confers a 90% prohabihty of vmidirectionality. 
Figure 7 plots the probability Pj of having unidirectional transcript and flux 
changes as a function of transcript level 2). Note that for 3) = 1, Pj = 0.5 
reflecting an equal chance for either outcome whereas when 2) has very large 
10 positive or negative exponents I) approaches one. The present invention 
contemplates that more elaborate models for can be used, including those 
constructed by borrowing from mechanistic methods or otiher methods 
developed for linking transcript ratios to flux changes. 

After using the Pj probabihties to weigh the effect of each wj the 
15 following constraint is obtained: 



(6) 



Here a is the fraction of genes j expected to have unidirectional 
transcrQ>t and flux changes. Thus, a quantifies the "agreeabiliiy" between the 
transcript ratios and the flux changes predicted by EBA. Augmenting FBA 

20 with constraints (4), (5) and (6) si:verimpose8 in a probabilistic sense the 
quahtative information encoded in the gene expression profiles of DNA 
microarray experiments. The above described probabilistic framework in two 
ways is employed in a number of different ways. 

The optimization of FBA models typically yields numerous alternate 

25 optima. An elegant algorithm has been proposed for identi^ing all of them by 
Lee et al, 2000. The DNA microarray data can be used to identify the subset 
of alternate optima that are consistent with the experimentally determined 
genetic expression levels. Specifically, the parameter a can be used to rank 
the multiple optima (Lee et al., 2000) typically obtained after the optimization 
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of an objective function within an FBA model with lespect to their agreeability 
with the DNA array data. Results with the data from Oh & liao (2000) for the 
transition from growth on glucose to growth on acetate show that a can vary 
from 0.74 to 0.89 for the imposed growth rate of 0.3 hr^ depending on which 

5 alternate optimal solution is identified by the solver. Thus the FBA to 

expression profile agreeability can be improved as much as 20% by maximizing 
a for a given FBA optimal solution. The present invention also provides for 
the direct incorporation of the DNA ixiicroarray data iuto the FBA model. Here 
the sensitivity of the FBA objective to the imposed agreeability with the 

10 experimental transcript profiles can be adjusted by constraining the model to 
meet various values of a. Results, shown in Figure 8, for the same data from 
Oh & liao (2000) show a quadratic trend between the miniinum acetate 
uptake rate and the imposed agreeabiUty parameter, a, 

15 5. IDENTIFYING GENE CANDIDATES FOR RECOMBINATION 

The explosive growth of annotated genes associated with metabolism 
calls for a systematic procedure for determining the most promising 
reoombination^choices. Until now» recombinant DNA technology has been used 
to add straightforward conversion pathways which introduce new and 

20 desirable cellular functions. Here the objective is to utilize flux balance 
analysis and mixed-integer programming tools to select the mathematically 
optimal genes for recombination into £. coJi or otibier piokaryotes from a 
metabolic database encompassing many genes fix>m multiple spedes. The 
resulting pathways need not lie directly on main production pathways, as they 

25 may enhance production indirectly by either redirecting metabolic fluxes into 
the production pathways or by increasing the energy ef&ciency of the present 
pathways. 

A comprehensive stoichiometric matrix containing all known metabolic 
reactions firom the Kyoto Encyclopedia of Genes and Genomes (KEGG) 
30 ^Canehisa and Goto, 2000) and Ecopyc (Karp et al,, 2000), and other sources 
can be compiled and incorporated into the flux balance model of the model 
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organism (e.g., E, coli). We refer to this multi-spedes stoichiometric matrix as 
the Universal stoichiometric matrix. This multi-spedes stoichiometric matrix 
is a valuable resource for exploring in silica gene recombination alternatives 
and examining which prokaryote will be the most advantageous choice for a 
given bioprocessing application. 

Selecting up to 7i new genes to recombine into the host organism so that 
a metabohc objective v* is maximized can be formulated as an MILP problem. 
This is accomplished by augmenting the LP flux balance model with constraint 
yk = l,y/keE that ensures that all E. coli genes are present as well as 
constraints 

keNS \k J 

that allow up to foreign genes to be incorporated in E, coli out of the 
comprehensive list contained in the Universal matrix (Le.. NE). Here the host 
organism is assimied to be E. coli but in general any annotated prokaryotic 
microbe can be selected as the host organism. Reactions chosen by the model 
but absent in E. coli (Le., all non-zero yk elements of 2VE provide routes for 
maxiipulating the cellular metabolism through recombinant DNA technology. 

Preliminary results using the flux balance £. coli model of Pramanik 
and KfiAftling demonstrate that improvements to seven amino add production 
pathways of E. coli are theoretically attainable with the addition of genes from 
foreign organisms (see the table of Figure 9). J. Praminik and J. D. Eeasling, 
Biotech. Bioeng. 56, 398 (1997). 

In most cases, only one or two genes were added to tiiie original amino 
acid production pathway even though the complete list of 3,400 reactions was 
available for selection. The mechanism of all identified enhancements is either 
by: (i) improving the energy effidency and/or (ii) increasing the carbon 
conversion efficiency of the production route. Manipulation of the arginine 
pathway showed the most promise with 8.75% and 9.05% improvements for 
growth on glucose and acetate, respectively. Figure 10 shows the pathway 
modifications introduced in the recombined network for growth on glucose. 
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Overall, the additional genes used by the Universal model save the original 
pathway three net ATP bonds increasing arginine production by 8,75%. 
Similar trends are revealed when other native and Universal amino acid 
production routes for glucose and acetate substrates are examined. 

The models of the present invention that have been described can-also 
be extended to encompass more gene candidates for recombination as they 
become available through ongoing genome projects. The present invention 
applies to any number of organisms, including other microorganisms of 
industrial significance. Even though coli is one of the most industrially 
significant microorganisms, other microbes confer advantages due to their 
relaxed regulatory mechanisms. For example, various species of the genera 
Corymbacterium and Brevibacterium have been employed to produce 
glutamate by e^loiting a phospholipid-deficient cytoplasmic membrane 
enabling the secretion of glutamate into the medimn. Riboflavin, or vitamin 
B2, overproduces include Eremothecium ashbyii Qjni Ashbya gassypii in which 
no repressive efGects from ferrous ion are observed. 

The logic based constraints of the present invention can be integrated 
with the gene selection MILP £3rmxilation to tighten the obtained predictions. 
By contrasting the optimal recombination changes identified for the production 
of di£Eerent amino adds, recombination strategies that point towards 
simultaneous yield improvements of multiple amino adds are identified. The 
invention's optimization framework for guiding gene additions provides the 
quantitative means to study flux enhancements throTi^ £Dreign gene 
recombination firom an ever-expanding database of available genes. Although 
complete gene-enzjncne relationships are not currentiy known, the formulation 
allows the incorporation of this information as it becomes available. 



6. GENE DELETIONS (MINIMAL SETS) 

The recent explosion of fully sequenced genomes has brought significant 
attention to the question of how many genes are necessary for sustaining 
cellular life. A minimal genome is generally defined as the smallest set of 
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genes that allows for replication and growth in a particular environment. 
Attempts to uncover this minimal gene set have included both experimental 
and theoretical approaches. Theoretical methods are based on the hypothesis 
that genes conserved across large evolutionary boundaries are vital to cellular 

5 survival. Based on this hypothesis, a minimal set of 256 genes was combed 
by assuming that genes common to both M genitalium and Haemophilus 
influenzas must be members of a minimal genome. A. R. Mushegian and E. V. 
Koonin, P. Natl. Acad. Sci. USA 93, 1026 (1996). 

Interestingly, however, only 6 out of 26 fi. coli open reading frames of 

10 unknown function conserved in M. genitalium were deemed essential to 
species survival F. Arigoi, et al, Nat. BiotechnoL 16, 851 (1998). The 
existence of multiple, quite different, species and environment specific minimal 
genomes has loi^ been speculated M. Husmen, M., Trends Genet. 16, 116 
(2000). The present invention provides for a computational procedure for 

15 testing this daim by estimating the minimum life-sustaining core of metabolic 
reactions required for given growth rates under diifferent uptake conditions. 
This problem can be formxdated as the following optimization problem 

M M 

min^^j^i subject to 2] SgV^=fc„ i=l,...,^ 
wilh andO^v^ ivj^yj 

tiiat solves for the smallest set of metabolic reactions that satisfies the 
20 stoichiometric constraints and meets a biomass target production rate v^^l^ . 
Alternatively, instead of a biomass target, minimum levels of ATP production 
or lowest allowable levels of key components/metabolites could be incorporated 
in the model One novel feature of this aspect of the invention is that whereas 
previous attempts utilized reductionist methodologies to extract the set of 
25 essential genes through a series of gene knock-outs, here we simultaneously 
assess the efl»act of all reactions on biomass production and select the minimal 
set that meets a given growth rate target (whole-system approach). A minim al 
gene set can then be inferred by mapping the en2yme(s) catalyzing these 
reactions to the corresponding coding genes. 
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Results based on the E. coli FBA model of Edwards and Palsson for the 
first time quantitatively demonstrated that minimal reaction sets and thus 
corresponding minimal gene sets are strongly dependent on the uptake 
opportimities afforded by the growth mediimi and the imposed growth 

5 requirements. J. S. Edwards and B. 0. Palsson, Proc Natl. Acad. Sci. USA 97, 
5528 (2000). Specifically, the minimal reaction network (subset of only E. coli 
reactions), was explored for different growth requirements under two 
contrasting uptake environments (a) restricting the uptake of organic material 
to glucose only and (b) allowing the uptake of any organic metabolite with a 

10 corresponding transport reaction. These two extreme uptake scenarios were 
chosen to model mflyinmnn and TniiiiTmim reliance on internal metaboUsm for 
component synthesis and probe its e£G^t on the minimum reaction set 
required. The TwiniimiiTn nxunber of metabolic reactions as a function of the 
imposed biomass growth target, (as a 96 of the theoretically maximum), for the 

15 two uptake choices is shown in F^^ure 11. 

While it is predicted that an E. coli cell grown on a meditmi containing 
only glucose requires at least 226 metaboUc reactions to support growth, a cell 
cultured on a rich optimally engineered medium could support growth with as 
few as 124 metaboUc reactions. As expected, the minimal reaction set becomes 

20 larger by increasing the required growth rate. However, the magnitude of this 
increase is quite different for the two cases. In case (a) the minimal reaction 
set increases only icom 226 to 236 to meet the maximum growth rate, however, 
in case (b) the minimal reaction set almost doubles going firom 124 to 203. 
Furthermore, neither the minimal reaction sets nor their corresponding 

25 reaction fluxes were found to be imique. Even after excluding cycles and 
isoenzymes hundreds of multiple minimal sets were identified providing a 
computational confirmation of the astounding redundancy and flux redirection 
versatility of the E, coli network. More importantly for case (a), all minimal 
reactions sets identified included 11 out of 12 reactions whose corresponding 

30 gene deletions were determined experimentally to be lethal for growth on 

glucose. Earlier analyses (Edwards and Palsson, 2000) based on a single gene 



wo 02/055995 



PCT/US02/00660 



deletions conducted with this model using LP optinuzation were able to 
identify only 7 out of 12 lethal gene deletions motivating the importance of 
considering simultaneous gene deletions within an MTTjP framework. 

The present invention contemplates that this framework can be built on 

5 by constructing different minimal reaction sets for not just E. coli but other 
species separated by wide evolutionary boundaries. By contrasting the 
obtained minimal sets, a comparison of minimal reaction sets (metabolic gene 
sets) along di£Eerent evolutionary branches can be made. For example, 
organisms such as M genitalium and H. influenza can be used with results 

10 benchmarked against earUer studies (Mushegian and Eoonin, 1996). By 
lumping reactions occurring in many difEerent species within the Universal 
stoichiometric matrix described earlier a species independent minimal 
metabohc reaction set can also be constructed. The piedicted E. coli based 
metabolic mioimal set of 124 reactions/genes is comparable to the 94 metabolic 

15 genes included in the minimal gene set proposed by Mushegian and Eoonin 
(1996). The present invention contemplates that this prediction gap can be 
reduced by (i) identifying more ef&dent reaction combinations, including those 
occurring in non-S. coK species, and (ii) by uncovering genes that are involved 
in the uptake or secretion of multiple (similar) metabolites reducing the total 

20 count. CSlearly, the proposed computational framework is dependent upon a 
reaction-based analysis which inherently cannot account for genes associated 
with translation, replication, recombination, repair, transcription, cellular 
structure and genes of unknown function. However, it does afford the 
versatility to study different uptake/secretion environments as well as to 

25 encompass reaction sets from multiple species in the search for the metabolic 
minimal genome providing valuable insight and perspective to the questions of 
what is the minimal genome and how is it shaped by the environment. As 
more elaborate models are developed describing elementary functions of 
minimal cells, such as the work of Browning and Shuler for the initiation of 

30 DNA rephcation, more detail will be added to the modeling framework. S. T. 
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Browning and M. L. Shuler, AICHE Annual Meeting, Session 69, Session 69, 

Los Angeles (2000). 

Apart from developing a framework for rationally identifying "minimal" 

metabolic networks we also intend to exploit the capability of predicting in 
5 silico lethal gene deletions £Dr different organisms and uptake environments. 

By identifying lethal gene deletions for pathogenic microbes as a function of 

the environment (e.g., H. pylori) a ranked list of promising targets for 

therapeutic intervention (i.e.. interruption of gene expression) can be compiled. 

This list can further be refined by imposing constraints ensuring that human 
10 metabolism do not adversely be affected by repressing the expression of any of 

the pathogen genes included in the hst. 

7. MIXED-INTEGER LINEAR PROGRAMMING SOLUTION 
TECHNIQUES 

IS The modeling framework of the present invention further provides for 

computational procedures to be used to solve the network problems presented. 
The computational procedures to be used include mixed-integer linear 
programming techniques. 

The algorithmic frameworks of the present invention in the context of 

20 gene addition^ regulation, DNA array data superposition, genetic circuit 

elucidation and minimal reaction set identification inherently require the use 
of discrete optimization variables that give rise to MILP problems. Unlike LP 
problems which can be routinely solved even for hundreds of thousands of 
variables by employing conunerdal solvers (e.g., OSL, GPLEX, UNDO, etc.) 

25 with minimal or no user intervention, MILP problems are much more 

computationally challenging typically requiring not just more CPU time but 
also user intervention. Specifically, it is typically necessary to (i) cast the 
problem in a form that is more amenable to MILP solution techniques, and (ii) 
if the problem is still intractable for commercial solvers, to construct 

30 customized solution methodologies. 
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The key source of complexity in MILP problems in metabolic networks is 
the number of reactions/genes whose on or off switching as well as prediction 
of over- or imder*-expression requires binary 0-1 variables to describe. These 
problems belong to the class of generalized network problems (Ahuja et al, 

5 1993) where each metabolite constitutes a node and each reaction represents 
an arc in the network. Given that existing FBA models for prokaryotes 
(Edwards and Palsson. 2000) contain hundreds of reactions and upcoming 
models for S. cerevisiae will likely be in the thousands motivates the need to 
harness complexity. In addition, the tremendous redxmdancy, redirection 

10 capabiHty and multiplicity of steady-state solutions further exasperates 

complesdty issues. In light of these challenges, some of the problems addressed 
by the present invention so £ar, particularly in the context of the minimal 
reaction sets required CPU's in the order of 50 hours. 

A number of preprocessing and reformulation techniques can be used 

15 according to the present invention to alleviate the computational burden. 
These techniques include isoenzjone grouping, futile cycle exclusion and 
network coimectiviiy constraints. Isoensyme grouping refers to the 
aggregation of reactions differing only in the cataljrzing enzyme (i.e., 
isoenzymes) in a single reaction. This reduces complexity by pruning the total 

20 number of binary variables. Futile cycle exclusion addresses the removal of 
sets of reactions (2 or more) which collectively recycle fluxes in a loop without 
any net efiect on metabolism or energy generation. In general, a set 
composed of K reactions forms a futile cycle if 

25 The following constraint: 

inactivates at least one reaction breaking the cycle. 

Connectivity constraints will ensure that if a reaction producing an 
intracellular metabolite is active, then at least one reaction consuming this 
30 metaboUte must be active and vice versa. In addition, if a reaction 
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transporting an extracellular metabolite into the cell is active, then at least 
one intracellular reaction consuming this metabolite must be active and vice 
versa. 

State of the art commercial MILP solvers such as CPLEX6.1 and OSL 
5 which run on a multiprocessor unix platform IBM RS6000-270 workstation can 
be used to solve these lypes of problems. For problem sizes that are 
intractable with commercial MILP solvers, customized decomposition 
approaches can be used. For example, Lagrangean relaxanon and/or 
decomposition by partitioning the original metabolic network into subnetworks 
10 loosely iaterconnected with only a handful of metaboUtes can be used. By 
iteratively solving many smaller problems instead of one large one 
computational savings are expected. Further, the present invention 
contemplates the use of diqunctive programming approaches which combine 
Boolean with continuous variables. These methods have been shown to be 
15 particularly efiEsctive for MELP problems where all the 0-1 (i.e.. Boolean) 

variables are aggregated into logic constraints as is the case with many of the 
MHP formulations of the present invention. 

8. EXAMPLE: PROBING THE PERFORMANCE IJMITS OF 
20 ESCHERICHIA COU METABOLIC NETWORK SUBJECT TO GENE 
ADDITIONS OR DELETIONS 

The framework of the present invention can be applied to a number of 
metabolic network problems in a number of different contexts. The present 
invention has been used to probe the performance limits of the E. coli 
25 metabolic network subject to gene additions or deletions. According to this 
example, an optimization-based procedure for studying the response of 
metabolic networks after gene knockouts or additions is introduced and 
applied to a linear flux balance analysis (FBA) E. coli model. Both the gene 
addition problem of optimally selecting which foreign genes to recombine into 
30 E, coli, as well as the gene deletion problem of removing a given number of 
existing ones, are formulated as mixed-integer optimization problems using 
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binary 0-1 variables. The developed modeling and optimization framework is 
tested by investigating the effect of gene deletions on biomass production and 
addressing the maximum theoretical production of the twenty amino adds for 
aerobic growth on glucose and acetate substrates. In the gene deletion study, 
5 the smallest gene set necessary to achieve maximum biomass production in E. 
coK is determined for aerobic growth on glucose. The subsequent gene 
knockout analysis indicates that biomass production decreases monotonically 
rendering the metabolic network incapable of growth after only 18 gene 
deletions. 

10 In the gene addition study, the E. coli flux balance model is augmented 

with 3,400 non-E. coli reactions from the KEGG database to form a multi- 
^edes model. This model is referred to as the Universal model This study 
reveals that the maximum tiieoretical production of six amino acids coxild be 
improved by the addition of only one or two genes to the native amino add 

15 production pathway of E. coK, even though the model could choose from 3,400 
foreign reaction candidates. Specifically, manipulation of the arginine 
production pathway showed the most promise with 8.75% and 9.05% predicted 
increases with the addition of genes for growth on glucose and acetate, 
respectively. The mechanism of all suggested enhancements is either by: (i) 

20 improving the energy effidency and/or (ii) increasing the carbon conversion 
ef&dency of the production route. 

This example according to the framework of the present invention uses 
flux balance analysis and mixed-integer programming tools to select the 
mathematically optimal genes for recombination into E. coli from a metabolic 

25 database encompassing many genes from mtdtiple spedes. The resulting 
pathways need not lie directly on main production pathways, as they may 
enhance production indirectly by either redirecting metabolic fluxes into the 
production pathways or by increasing the energy ef&dency of the present 
pathways. 

30 The recent upsurge of sequenced genomes has also brought significant 

attention to the question of which genes are crucial for supporting cellular life. 

25 
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Flux balance analysis modeling provides a useftd tool to help elucidate this 
question. Although FBA models cannot simulate the regulatory structxure 
alterations associated with gene deletions, these models can capture whether 
sufficient network connectivity exists to produce metabolites critical to cellular 

5 survival. In fact, a recent FBA model proposed by Edwards & Palsson (2000) 
was able to qualitatively predict the growth patterns of 86% of the mutant E. 
coli strains examined. This model was also used to identify some of the 
essential gene products of central metabolism for aerobic and anaerobic E. coli 
growth on glucose. J. S. Edwards and B. O. Palsson, BMC Bioinformatics 

10 2000b 1, 1. 

Determining the maximum number of tolerable gene deletions in a 
given metabolic system, however, requires a discrete optimization strategy in 
which multiple gene deletions can be simultaneously examined A related 
approach utilizing discrete optimization to identify all alternate optima in 

15 linear metabolic models has been proposed by Lee et al. (2000). 

Acoordix^ to the present invention, we examine how stoichiometnc 
boundaries of cellular performance expand or contract in the presence of 
multiple gene additions or deletions. A FBA model of the cellular metabolism 
of EL coli is constructed incorporating the reaction pathways provided by 

20 Pramanik and Keasling (1997) along with modifications suggested by Karp 
(1999) based on more recent data. The modifications are either small molecule 
corrections based on more recent metabolic information or the removal of 
certain pathways now known to be absent from the £. coli genotype. A 
stoichiometric matrix as suggested by Schilling containing aU metabolic 

25 reactions from the Kyoto Encyclopedia of Genes and Glenomes is compiled and 
incorporated into the model C. H. Schilling, et al., Biotech. Prog. 15. 288 
(1999). We refer to this multi-species stoichiometric matrix as the Universal 
stoichiometric matrix. A short discussion of flux balance analysis will be 
presented next, followed by the gene addition and deletion formulations and 

30 their appHcation to biomass and amino acid production in E, coli. 
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8.1 Flux Balance Analysis 

Flux balance analysis (FBA) requires only the stoichiometry of 
biochemical pathways and cellular composition information to identify 
boimdaries for the flux distributions available to the cell. Although 

5 microorganisms have evolved highly complex control structures which ^ 
eventually collapse these available boundaries into single points, FBA models 
are still valuable in setting upper bounds for performance targets and in 
identifying ^'ideal" flux distributions. The underlying principle of FBA is mass 
balances on the metabolites of interest For a metabolic network comprised of 

10 N metabolites and M metabolic reactions we have, 

EVy^^* (7) 

where Sij is the stoichiometric coefGicient of metabolite i in reaction j, vj 
represents the flux of reaction and b; quantifies the network's uptake (if 
negative) or secretion (if positive) of metabolite i. For all internal metabolites, 

15 hi is zero. Reversible reactions are defined simply as two irreversible reactions 
in opposite directions, constraining all fluxes to positive values. 

Typically, the resulting flux balance system of equations is 
underdetermined as the number of reactions exceeds the number of 
metabolites and additional information is required to solve for the reaction 

20 fluxes. Several researchers have measured external fluxes to add as 

constraints to their under-determined models, rendering them completely 
determined or over-determined. H. Jorgensen, et a/., Biotechnol. Bioeng. 46 
117 (1995); E. Papoutsakis and C. Meyer, Biotechnol. Bioeng. 27, 50 (1985); E. 
Papoutsakis and C. Meyer, Biotechnol Bioeng. 27, 67 (1986); A. Pons, et al. 

25 BiotechnoL Bioeng. 51(2), 177 (1996). However, additional assumptions such 
as removing reaction pathways are often needed before external flux 
measurements can completely define a system, and neglecting potentially 
active pathways to render a system completely defined may cause large 
changes in calculated fluxes (Pramanik, 1997). A popular technique for 

30 investigating metabolic flux distributions is linear optimization (Varma, 1994). 
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The key conjecture is that the cell is capable of sparmiag all flux combinations 
allowable by the stoichiometric constraints and thus achieving any flux 
distributions that maximize a given metaboUc objective (e.g., biomass 
production). The linear programming model for maximizing biomass 
5 production is: 

Maximize Z = v^^^^ 

b,e% Vi 

10 v^e9l*, V y 



where Vbiomass IS a flux drain comprised of all necessary components of biomass 
in their appropriate biological ratios. Other objective functions such as 
15 maximizing metabolite production, maximizing biomass production for a given 
metabolite production, and minimizing ATP production have also been 
investigated in the pridi* art. 

8.2 Escherichia coli stoichiometric models 

20 Microbial stoichiometric models incorporate collections of reactions 

known to occur in the studied species for simulating metaboUsm. The 
complete sequencing of the E, coli genome makes it a model organism for the 
study presented in this paper because extensive knowledge regarding its 
biochemical pathways is readily available. Varma and Palsson proposed the 

25 first detailed FBA E. coli model capable of predicting experimental 

observations. A. Varma and B. O. Palsson, J. Theor. Biol. 165, 503 (1993). 
The stoichiometric matrix included 95 reversible reactions utilizing 107 
metabolites for simulating glucose catabolism and macromolecule biosynthesis. 
This model was used to investigate byproduct secretion of £. coli at 
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increasingly anaerobic conditions and was able to predict the right sequence of 
byproduct secretion consistent with experimental findings: first acetate at 
slightly anaerobic conditions, then formate, and finally ethanol at highly 
anaerobic conditions. A. Varma, et al, AppL Environ. Microb. 59, 2465 (1993). 
5 Building on the previous model, Pramanik and Keasling (1997) introdueed a 
model that incorporated 126 reversible reactions (including 12 reversible 
transport reactions) and 174 irreversible reactions, as well as 289 metabolites. 
Pramanik and Keasling (1997) correlated the macromolecule composition of jB. 
coli as a fimction of growth rate, and verified their model with experimental 

10 data. The model successfully predicted several levels of genetic control such as 
the glycosylate shunt closing for growth on glucose and the PEP carbo:^kinase 
flux tending towards oxaloacetate. Furthermore, the glyooxylate shunt was 
active during growth on acetate while the flux through PEP carboxykinase was 
toward Phosphoenolpyruvate. 

15 The stoichiometric E. coli model used in this study employs 178 

irreversible, 111 reversible and 12 transport reactions compiled largely firom 
the model published by Pramanik and Keasling (1997). The modifications to 
the Pramanik and Keasling stoichiometric matrix are given in the table of 
Figure 12, They are primarily small molecule corrections (e.g., ATP in place of 

20 GTP for succinate thiokinase) or the removal of reactions now known to be 
absent from E. coli based on more recent data (Karp, 1999). Note that similar 
changes were also independently included in the most recently published E, 
coli model of Edwards and Palsson (2000). The metabolic network is fueled by 
transport reactions allowing an unconstrained supply of ammonia, hydrogen 

25 sulfate, and phosphate, along with a constrained supply of glucose or acetate to 
enter the system. Oxygen uptake is unconstrained to simidate aerobic 
conditions. Unconstrained secretion routes for lactate, formate, ethanol, 
glyceraldehyde, succinate, and carbon dioxide byproducts are provided by the 
transport reaction fluxes. The Universal model is constructed by incorporating 

30 3400 cellular reactions firom the Kyoto Encyclopedia of Genes and Genomes 
into the modified Keasling stoichiometric modeL The Universal stoichiometric 
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matrix contains all reactions known to occur in E. coli, as well as a number of 
reactions from other organisms. 

8.3 Mathematical Modeling of Gene Deletions/Additions 

5 Practically eveiy metaboUc reaction is regulated to some extent by one 

or more enzymes, produced by the translation of one or more genes. As a 
restilt, the removal of certain genes from microbial DNA sequences can be fatal 
or have Uttle if any effect depending upon the role of the enzymes coded for by 
these genes. Conversely, the addition of certain genes through recombinant 

10 DNA technology can have either no effect or produce novel desirable cellular 
functionaUties. Given a stoichiometric model ofE, coli metabolism and the 
Universal stoichiometric matrix encompassing reactions occurring in multiple 
species, the goal of this section is to formulate a mathematical model that (i) 
captures cellular robustness in the presence of multiple gene deletions, and (ii) 

15 identifies additional genes from the Universal data set having the most 
profound effect on improving a given metabolic objective. 

First, define E = {X&} = as the set of all possible genes where 

M represents the ntunber of E. coli genes and Trepresents the total number of 
genes in the data set. This set can be partitioned into two subsets E and NE 

20 where subset E represents genes present in E. coli and subset NE represents 
genes present only in non-K coli species: 

«S = {ifclAf+l^ifc^r} 
Subsequently, let binary variable yk describe the presence or absence of 
25 each gene k: 

{0 if gene k is not expressed in host oiganism 
1 if gene k is present and functional 

The selection of the optimal gene choices for deletion or insertion from 
DNA recombination can be determined by appropriately constraining the 
nxunber of non-zero elements in y . The case of removing a given number of 
30 genes, d, from E. coli can be investigated by including the xbllowing constraint: 

30 
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This ensures that no more than (M- d) genes are available to the 
metabolic network. Similarly, the effect of introducing any number of 
additional genes, h, can be investigated by utilizing: 
5 y,=), Vfce5 

(8) 
(9) 

Equation (8) allows all E. coli genes to be present and functional if necessary, 

10 while equation (9) sets an upper limit to the number of allowable additions. 

The optimal genes selected by the model are obtained by determining which 

elements otNE are equal to one. In addition, since multiple genes offcen 

correspond to a single reaction and occasionally multiple reactions are 

catalyzed by an enzjrme coded for by a single gene, the binary parameter ajk is 

15 defined to describe which, enzymes are coded for by which genes: 

if gene k has no direct effect on leaction j 
' gme k codes for an enzymecatalyzingieaction j 

Parameter sjk establishes links between genetic functional assignments and 
reactions. In order for a flux vj to take on a non-zero value, at least one gene 
must code for an enzyme catalyzing this reaction {c^ = 1) and this gene must 
20 be present and functional in the host organism (yk = 1). Given that at least one 
gene must code for every enzyme we have, 

_i =0 if no g^e coding for the enzyme of reaction j is functional 

Zj ^yk if at least one gene coding for the en2^e of reaction 7 is functional 

This impUes that the following constraint, 



fO if 

"'ii if, 
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ensures that 1;;= 0 if there exists no active gene k capable of supporting 
reaction j. In this case, ^^^yt = 0, which in turn forces the value of vj to zero. 

k 

Alternatively, if at least one such gene is functional, then ^^^yk - 1* allowing 

k 

Vj to assume any value between a lower Lj and an upper Uj bound. These 
5 hoimds are set by niininni7:in g/m5iYimiy.ingr respectively the given flux vj subject 
to the stoichiometric constraints. These problems are solved using CPLEX 6.6 
accessed via the commercial software package GAMS. Problems with up to 
3700 binary variables were solved on an IBM RS6000-270 workstation. 

10 8.4 Gene Knockout Study 

In this ^cample according to the presentation, we determine what is the 
smallest gene set capable of maximizing biomass production on glucose 
substrate (uptake basis: 10 mmol ) and what is the maximum number of gene 
deletions from this gene set that still maintains a specified level of biomass 

15 production. First, we maximized the biomass production flux, v^-^ . The 

solution yields the maximum theoretical level of biomass production (v^^ = 
1.26 g biomasa/gDW-hr) achievable by the metabolic network within the 
stoichiometric constraints. Next, the minimum number of genes that 
maintains a specified target level of biomass production y^^^ (as a percentage 

20 of the maximum) is determined. The new objective function minimizes the 
total number of functional E. coli genes available to the cell subject to the 
constraint of setting biomass production v^^^^^ greater than or equal to VfJ^^ . 
This problem is formulated as: 

Minimize Z = ^yt 

25 

M 

subject to ^SgVj=bf, Vi 
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'^bioauas 

6, g5R, Vi 

5 >;,eM>V*e£ 

where the nonzero elements of define the minlmuni gene set capable of 
attaining the target growth rate. The smallest gene set Miooh , capable of 
sxistaining the maximum theoretical growth rate is obtained by setting 
10 v^iwr =100% '^wLif • Th® model predicts that 202 non-transport intracellular 
reactions out of 400 available reactions (111x2 reversible reactions + 178 
irreversible reactions) are required to sustain . These reactions include 
the glycolytic reactions, the pentose phosphate pathway, the TCA cyde, the 
respiratory reactions and all other anabolic and catabolic routes necessary for 
15 optimal growth. 

Given Mioo% , the next goal is to determine which of these genes could be 
knocked-out while still allowing the metaboUc network to sustain specified 
sub-optimal growth rates. This is accomplished by setting equal to 
various percentages of v",„ and constraining the intracellular reaction fluxes 
20 outside of Mioo% to zero. It must be noted that this assumption prevents the 
model &om activating any genes outside of the Mioo% set and the significance 
of this assumption will be discxissed in the following section. The number of 
allowable gene knockouts for various biomass production levels are given in 
Figure 13 while the selected gene removals are presented in the table of Pigxire 
25 14. As expected, as the biomass production demands on the network are 
lessened, the model tolerates more gene knockouts. However, the range of 
allowable knockouts is rather small. SpedficaUy, the model tolerates at most 
9 gene deletions with a biomass requirement oiQOVo^v^^ , while 18 gene 
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removals render the network incapable of biomass formation. Thus the subset 
containing all elements of Mioo% minus the 18 gene knockouts (194 genes) 
describes the smallest subset of Mioox capable of suistaining E. coli cellular 
growth for the employed FBA model. Additionally, it must be noted that all 
s subsets include the seven experimentally verified essential gene products of 
central metabohsm identified by the in silico gene deletion study ofE. coli 
conducted by Edwards and Palsson (2000b). 

8.5 Discuission on the Gene Deletion Study 

10 Investigation of the specific gene knockouts provides interesting insight 

into the effect of various energy generation pathways. The suggested gene 
deletions imply that the energetic status of the network is improved as the 
required biomass production demands on the ceU are reduced. This is 
demonstrated by the fact that as the biomass requirements are lessened, the 

15 optimization formulation sequentially eliminates pathways responsible for the 
formation of energy. One such observation involves the gradual degradation of 
the TCA cycle. When the model is constrained to produce only 80% of the 
optimal level of biomass, the network no longer utilizes the succinate 
dehydrogenase enzyme to produce FADHz. Further reducing the biomass 

20 production requirement to 70% enables the removal of the fiunAB, mdh, and 
sucCD genes forgoing the formation of one GTP and one NADH per unit 
reaction flux. The next mqjor energy formation pathway to be eliminated 
occurs at a biomass production level of 20%. At this point, the energetic state 
of the cell is such that it no longer requires the formation of ATP from the 

25 cellular proton gradient. Finally, at the lowest biomass production levels, the 
cell no longer requires the oxidation of NADH to force protons across the 
cellular membrane. 

This study provides insight into the dependence of cellular growth on 
various energy generation pathways and provides an estimate of the miTiimiiTn 

30 number of metabolic genes capable of enabling celltdar growth. The prediction 
of 194 genes is lower than the theoretical estimation of 256 by Mushegian and 
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Koonin (1996) obtained by investigating the complete genomes of Haemophilus 
influenzae and Mycoplasma genitalium and assuming genes preserved across 
large phylogenetic distances are most likely essential This was expected 
considering the inability of this reaction-based framework to acoonnt for genes 
5 associated with translation, transcription, replication, and repair, and the 
lumping of pathways by the stoichiometric model. A more practical 
comparison involves considenng the number of metabolic genes included in the 
minimal gene set estimation. In this case, the predicted set of 194 metaboKc 
genes overestimates the 94 metabolic genes included in the minimal gene set 

10 proposed by Mushegian and Koonin (1996). This overestimation arises in part 
because the effect of activating metabolic genes outside of the original optimal 
gene set was not investigated. This lowers the minimal gene set estimation by 
openio^ additional metabolic routes. Furthermore, tiiis study only allowed 
glucose to enter the network as organic fuel and limited metabohc capacity can 

15 be compensated for by a proportionately greater dependence on the 
importation of nucleosides, amino adds, and other metahoUtes. CA 
Hutchison, et al., Science 286, 2166 (1999). 
8.6 Amino Acid Production Optimization Studies 

In this section, we identify mathematically optimal reaction pathways to 

20 recombine into the E, coli metabolic network to optimize amino add formation 
for growth on glucose and acetate. We explored the theoretically optimal 
formation of all twenly amino adds. Each optimization run was performed for 
two cases: ^ induding only the reactions present in E. coU, and (d) allowing 
the model to select all reactions from the Universal stoichiometric matrix. The 

25 problem of maximizing the amino add production is formulated by 

substituting amino add accumulation, , in place of v^;^ in equation (7), 
while the problem of maximizing the amino add formation 6^ of the 
Universal network is formulated as: 

Maximize Z^b^ 
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subject to ^SgVj = 6. , V i 



Note that this formulatioii allows the selection of any ntunber of 
reactions from the midti-^edes reaction hst. Reactions chosen by the model 

10 but absent in E. coli (i.e., aU non-zero yk elements of NE) provide routes for 
manipulating the cellular metabolism through recombinant DNA technology. 
The theoretical amino acid production capabilities of the E, coli metaboUc 
network, with and without the additional reactions from the Universal matrix, 
are shown in the table of Figure 9 for growth on glucose and acetate. It must 

15 be noted that it is the structural pathway changes predicted by the model that 
axe more meaningfid than the exact numerical values because these are 
theoretical maximum yield calculations. Predictions by the Varma and 
Palsson (1993) model are shown br comparison. As expected, the maximum 
production capabilities by the Varma and Palsson (1993) model are slightly 

20 below the predictions of the more complex employed model due to the 
additional metabohc routes available for production. 

The results show that improvements to seven amino acid production 
pathways ofE. coli are theoretically attainable with the addition of genes from 
various organisms. Manipulation of the arginine pathway shows the most 

25 promise, with 8.75% and 9.05% increases with additional genes for growth on 
glucose and acetate, respectively. The optimal recombinant asparagine 
pathway shows 5.77% and 5.45% increases over current E. coli growth on 
glucose and acetate, while cysteine production can be raised 3.67% and 3.80%, 
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respectively. The histidiiie production pathway is revealed as another 
encouraging target for DNA recombination with 0.23% and 4.53% 
improvements available as well. The isoleucine, methionine, and tryptophan 
formation pathways offer the final three genetic objectives for enhancing 
5 production. 

The enzymes responsible for introducing these various improvements to 
the E, coli amino acid production pathways are shown in the table of Figure 
15. In most cases, the addition of only one or two genes to the original amino 
acid production pathway results in an increased maximum theoretical yield 

10 even though the complete list of 3,400 reactions was available for selection. 
For example, introducing foreign genes coding £3r carbamate kinase and the 
pyrophosphate dependant version of 6-phospho&uctokinase further optimizes 
arginine production for growth on glucose, while adding carbamate kinase and 
another gene codii^ for acetate kinase renders the arginine production 

IS pathway on acetate stoichiometrically optimal. Expressing the genes coding 
for aspartate-ammonia hgase and sulfate adenylyltransferase in E. coli results 
in the increased mentioned earlier in asparagine and cysteine productions, 
respectively. Only the production of isoleucine on glucose and acetate 
substrates and the production of methionine on acetate require over two 

20 additional enzymes to reach optimality according to the modeL 

8.7 Discussion on the Gene Addition Study 

Careful examination of these amino add pathways reveals how these 
additional enssymes improve the energetic efficiency of the original routes. The 

25 original and Universal arginine production pathways for growth on glucose are 
shown in Figure 16. The two pathways differ in only two reactions — the 
pyrophosphate dependant analog of 6-phospho&uctokinase in the Universal 
model replaces the ATP dependent version present in E. coli, and carbamate 
kinase in the Universal model replaces carbamoyl phosphate synthetase from 

30 the original E, coli model. The first improvement to energy utilization occurs 
because the Universal model 6-phosphofi:uctokinase uses pyrophosphate 
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formed &om Axgininosuccinate sjmthase reaction instead of ATP to transfer a 
phosphate group to fructose-6-phosphate in the third step in glycolysis. The E, 
coli model, which sends this pyrophosphate through pyrophosphatase for 
hydrolytic cleavage, in effect wastes the energy from this energy-rich 

5 phosphoanhydride bond. By recapturing this otherwise wasted energy, the 
pyrophosphate version of 6-phosphofi:uctokinase requires one less ATP 
phosphoanhydride bond per arginine molecxile produced. 

The second form of cellular energy savings is realized by the 
replacement of carbamoyl phosphate synthetase. The native carbamoyl 

10 phosphate synthetase creates one mole of carbamoyl phosphate from carbon 
dioxide at the expense of two ATP phosphoanhydride bonds. This reaction also 
requires an amino group of one glutamine molecule, which subsequently forms 
glutamate. Reforming glutamine from glutamate requires yet another ATP; 
thus each imit flux through carbamoyl phosphate synthetase requires three 

15 ATP. Carbamate kinase, incorporated in the Universal model, forms 

carbamoyl phosphate from carbon dioxide and ammonia at the expense of only 
one ATP. Therefore, carbamate kinase requires two less ATP bonds per imit 
flux of carbamoyl phosphate formed. Overall, the additional genes used by the 
Universal model save the origmal pathway three net ATP bonds increasing 

20 arginine production by 8.75%. A siinilar analysis can be performed on native 
and Universal arginine production routes from acetate substrate depicted in 
Figure 17* 

The E. coli asparagine production pathway is shown in Figure 18 for two 
modes of glucose entry into the metaboUc network — glucokinase and the 

25 phosphotransferase system. Interestingly, the E. coli model prefers 

glucokinase to the more common phosphotransferase system for glucose entry 
during optimal asparagine production. Although glucokinase is known to play 
a minor role in glucose metabolism tmder normal conditions, replacement of 
the phosphotransferase system by this reaction increases asparginine 

30 production from 1.560 mol/mol glucose to 1.818 mol/mol glucose. Glucose 
entry via the phosphotransferase system requires substantial flux through 
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phosphoenolpyruvate (PEP) synthase to regenerate PEP from pyruvate 
carrying the net expense of one ADF phosphoanhydride bond. Thus either 
over-expressing glucokinase in E. coli or adding a more active recombinant 
glucokinase enzyme may improve asparagine production. Figure 19 illustrates 

5 the optimal Universal route for asparagine production on glucose. By cEooBrQg 
the ADP-forming aspartate-ammonia Kgase en2yme over the AMP-forming 
version present in E. coli, the energy eflELciency of this pathway is improved. 
Presently no pathways for the conservation of the pyrophosphate bond energy 
have been identified in E. coli, thus the formation of AMP uses the equivalent 

10 of two ATP phosphoanhydride bonds. In contrast, by forming ADP, the 

Universal pathway requires the breakage of only one phosphoanhydride bond 
per unit flux. In fact, the energy efiELdency of the Universal model is such that 
the formation of asparagine does not require ATP formation from the traos- 
membrane proton gradient. This gradient is used solely to transport inorganic 

15 phosphate into the cell. This mechanism improves asparagine production 
5,77% for growth on glucose and 5.45% for growth on acetate. 

The optimal histidine production pathways of the E. coli and Universal 
models for growth on acetate are shown in Figure 20. Again, the Universal 
model selects a reaction to conserve the phosphoanhydride bond energy of 

20 pyrophosphate generated in this case by both ATP phosphoribosyltransferase 
and phosphoribosyl-ATP pyrophosphatase. Thus the Universal model is at 
least 2 ATP more efGLdent than the E. coU model per histidine molecule 
produced. In addition, the addition of gjLydne dehydrogenase to the E. coli 
model improves the carbon conversion of the native histidine pathway. Under 

25 optimal histidiue production conditions in native E. coli, intracellular glycine 
is converted to carbon dioxide and ammonia by the glycine cleavage system. In 
this process, only one of glycine's carbons is conserved by its transfer to 
tetrahydrofolate. The Universal model, on the other hand, conserves both 
carbons by converting glycine to glyoxylate which subsequently is pumped 

30 back into the glyoiQrlate shunt. Both mechanisms improve the maximum 
theoretical yield of histidine 4.53%. 



wo 02/055995 



PCTAJS02/00660 



8.8 Conclusions 

The proposed optimization framework provided the quantitative means 
to study metabolic network performance in response to gene deletions or 
additions. Metabolic network performance relates to either robustness in the 

5 face of gene deletions or flux enhancements through foreign gene 

recombination from an ever-expanding database of available genes. Although 
complete gene-enzjnne relationships are not currently available, the 
formulation enables the incorporation of this information as it becomes 
available. The gene knockout analysis revealed that the E. coli metabohc 

10 network optimized for growth could endure an increasing amount of gene 
knockouts as its growth demands are lowered. Furthermore, the network 
ooidd theoretically tolerate at most 18 gene deletions before biomass 
production is no longer possible. The gene addition studies revealed that 
adding additional options to the coli genotype by DNA recombination 

IS provided improvements to the maximum theoretical productions of seven 
amino adds. These improvements occur by one of two mechanisms: (i) by 
improving the energy e££ciency or (ii) by increasing the carbon conversion 
ef&ciency of the production route. 

The reliance of flux balance analysis strictly on stoidiiometric 

20 characteristics is its greatest strength but also can be its most prominent 
weakness. The flux distributions within the cell are ultimately uniquely 
determined by the regulatory mechanisms within the cell, the kinetic 
characteristics of cellular enzymes, and the expression of these en2ymes. 
Assuming celts operate in a stoichiometricaUy optimal fiashion jdelds a wider 

25 boundary of metabolic flux distributions than may be available to the cell. 
Currently we are incorporating regulatory information into flux balance 
models with the use of logic constraints. These constraints will ensure that up 
or down movements in metabolite concentrations are consistent with up or 
down shifts in reaction flux values. A more tightly constrained model will give 

30 additional insight on how overproducing ceUular products affects overall 
metabolic regulation. As the accuracy of metabohc models improves and the 
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amount of information available for flux balance analysis grows, tbe 
framework introduced in this paper can be used to select the most optimal 
gene addition and/or deletion metabolic manipulations to perform. 

5 9. EXAMPLE: MINIMAL REACTION SETS FOR ESCHERIGinA COLI 
METABOLISM UNDER DIFFERENT GROWTH REQUIREMENTS AND 
UPTAKE ENVIRONMENTS 

The framework of the present invention can be applied to a number of 
metabolic network problems in a number of different contexts. The framework 

10 of the present invention has also been applied to determicdng minimal reaction 
sets for E. Goli metabolism under different growth requirements and uptake 
environments. According to the present invention^ a computational procedure 
for identii^^g the mmimaT set of metabolic reactions capable of supporting 
various growth rates on different substrates is introduced and applied to a flux 

15 balance model of the E. coli metabolic network. This task is posed 

mathematically as a generalized network optimization problem. The minimal 
reaction sets capable of supporting specified growth rates are determined for - 
two different uptake conditions (i) limiting the uptake of organic material to a 
single organic component (e.g., glucose or acetate) and (d) allowing the 

20 importationofaoymetaboUte with available oeUular transport readi We 
find that minimal reaction network sets are hie^bly dependent on the uptake 
environment and the growth requirements imposed on the network. 
Specifically, we predict that the E. coli network, as described by the flux 
balance model, requires 224 metabolic reactions to support grovTth on a 

25 glucose-only medium and 229 for an acetate-only medium, while only 122 
reactions enable growth on a specially engineered growth medium. 

The recent explosion of fully sequenced genomes has brought significant 
attention to the question of how many genes are necessary for sustaining 
ceUular Ufe. A minimal genome is generally defined as the smallest set of 

30 genes that allows for replication and growth m a particular environment. 
Attempts to uncover this minimal gene set include both experimental and 
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theoretical approaches. Global transposon mutagenesis was used by 
Hutchison et oZ. (1999) to determine that 265 to 350 of the 480 protein-coding 
genes ot Mycoplasma genitalium^ the smallest known cellular genome (580 kb), 
are essential for survival under laboratory growth conditions. Additional 

5 experimental work revealed that only 12% and 9% respectively of the yeast 
and Bacillus subtilis genomes are essential for ceUuIar growth and repHcation. 
M. G. Goebl and T. D. Petes, CeU 46, 983 (1986); M. Itaya, FEBS Lett 362, 257 
(1995). Theoretical methods stem from the assiunption that genes conserved 
across large evolutionary boundaries are vital to cellular survival. Based on 

10 this hypothesis, a minimal set of 256 genes was compiled by Mushegian and 
Koonin (1996) by assuming genes common to M genitalium and Haemophilus 
influenzae must be members of a minimal genome. Interestingly, only 6 out of 
26 E. coli open reading frames of unknown function conserved in M 
genitalium were deemed essential to species survival (Arigoi, et aL 1998). The 

15 existence of multiple, quite different, species and environment specific minimal 
genomes has long been speculated (Huynen 2000). 

Here we describe a computational procedure for testing this claim by 
estimating the TnininniTn required growth-sustaining core of metabolic 
reactions imder different uptake conditions. The latest stoichiometric model of 

20 E. coli metabolism proposed by Palsson and coworkers (Edwards & Palsson 
2000b) is employed to identify the smallest set of en^rmatic reactions capable 
of supporting given targets on the growth rate for either a gLuoose, an acetate, 
or a complex substrate. This flux balance analysis (FBA) model incorporates 
454 metaboUtes and 720 reactions including the glycolysis, tricarboxyUc add 

25 (TCA) cycle, pentose phosphate pathway (PPP), and respiration pathways 
along with sjmthesis routes for the amino acids, nucleotides, and hpids. 
Growth is quantified by adding an additional reaction to the model simvdating 
a drain on the various components of E, coli biomass in their appropriate 
biological ratios. F. C. Neidhardt, Escherichia coli and Salmonella: Cellular 

30 and Molecular Bioloev. ASM Press ed. Washington, D.C., 1996. By associating 
a gene to each metaboUc reaction in the network, gene activations and 
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inactlvations are incorporated into the FBA model using logic 0-1 binary 
variables. The problem of minimizing the number of active metabolic reactions 
required to meet specific metabolic objectives (i.e., growth rates) is shown to 
assume the mathematical structure of a generalized network flow problem 

5 where nodes denote metabolites and connecting arcs represent reactions? 
Alternatively, instead of a biomass target, TniniTmini levels of ATP production 
or lowest allowable levels of key components/metabolites could readily be 
incorporated in the model. A mixed-integer linear programming (MILP) 
solver, CPLEX 6.5 accessed via GAMS, is employed to solve the resulting 

10 large-scale combinatorial problems with CPU times ranging from minutes to 
days. 

Based on the E, coli model, the miiumal reaction network is explored for 
different growth requirements under two contrastii^ uptake environments (i) 
restrictii^ the uptake of organic material to a single organic component and 

15 (ii) allowing the uptake of any organic metabolite with a corresponding 

transport reaction. These two extreme uptake scenarios were chosen to model 
maximum and Tnim>niini reliance on internal metabolism Sar component 
synthesis respectively, and probe their effect on the minimum reaction set 
required. Previous attempts utilized reductionist methodologies to extract the 

20 setofessentialgenes through a series of gene knockouts. Here we use an 
e£Gcient computational procedure for selecting the minimal set by 
simultaneously considering the effect of all reactions on cell growth. A 
minimal gene set is then be inferred by mapping the en^ymeCs) catalyzing 
these reactions to the corresponding coding genes. While the obtained residts 

25 are, in principle, dependent on the specifics of the employed flux balance E. 
coli model (Edwards & Palsson 2000), they still provide valuable insight and 
perspective to the questions of what is the minimal genome and how is it 
shaped by the environment. 

30 
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9.1 Results 

The first case study involves identifymg the iniiiimal reaction set 
supporting E. coli growth on a glucose substrate. A detailed description of the 
employed modeling procedure is provided in the appendix. A constrained 

5 amount of glucose (< 10 mmol/gDW hr), along with unconstrained uptake 
routes for inorganic phosphate, oxygen, sulfate, and ammonia are enabled to 
fuel the metabolic network. Secretion routes for every metabohte capable of 
exiting the cell are also provided. Under these conditions, the FBA model 
predicts that the E. coli reaction network is capable of achieving a maximum 

10 theoretical growth rate of 0.966 g biomass/gD W-hr, which we will refer to as 
the maximum growth rate (MGR). By requiring the reaction network to match 
the MGR we determined that at least 234 reactions out of 720 are required for 
maximum growth on glucose. 

The growth demands are then relaxed in subsequent studies to identii^ 

15 the minimal number of metabolic reactions required to meet various sub- 
maximal growth demands (% of MGR). Interestingly, the number of necessary 
metabolic reactions decreases only mildly with the falling growth demands 
imposed on the network as indicated by Figure 21. While a reaction set 
comprised of 234 reactions is needed for maximum growth, the minimal 

20 reaction set corresponding to growth rates of 30% and lower involves only 224 
reactions. The same minimal reaction set persists even for growth rates as low 
as 0.1% of the MGR. In general, the reaction set reductions are attained by 
successively eliminating energy producing reactions occurring in (i) glycolysis, 
(ii) the TCA cycle, and Qii) the pentose phosphate pathway as the growth 

25 demands are lessened^ However, certain reactions absent at higher growth 
rates enter the minimal sets at lower growth rates suggesting a much more 
complex mechanism of flux redirection than successive reaction elimination. A 
detailed description of the reactions entering/leaving the minimal reaction set 
as the imposed growth requirements are lowered is provided in the table of 

30 Figure 22. 
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For comparison, a similar study enabling a constrained amount of 
acetate (< 30 mmol/gDW-hr) to enter the network instead of glucose was 
performed (see Figure 21). Here the network is much less tolerant of reaction 
set reductions than in glucose study. While for a glucose substrate the 
minimal network sizes decrease from 234 to 224 reactions as the growtlr 
demands are lowered, for an acetate substrate the network sizes reduce only 
from 231 to 229 reactions. This implies that the minimal reaction set size is 
not only dependent on the imposed biomass production requirements, but also 
on the specific choice for the single substrate. 

It is important to note that neither the minimal reaction sets nor their 
corresponding reaction fluxes are unique. For example, for the 30% glucose 
uptake case we identified over 100 different minimal reaction sets containing 
exactly 224 enzymatic reactions without even counting the multiplicities 
associated with the 171 isoen2ymes present in the network. Among most of 
these multiple minimal reaction sets, the activity and fiux directions of the 
major pathways differ very Uttie. Most variations are concentrated on the 
catabolic parts of the networks. For instance, while some minimal reaction 
sets secrete carbon dioxide, acetate, and fiimarate as the only metabolic 
byproducts, other sets may also secrete varying amoimts of formate, glycerol, 
and the amino adds phenylalanine and tyrosine. These results provide a 
computational confirmation of the astounding redundancy and flux redirection 
versatility of the E. coU network. More importantiy, all minimal reactions sets 
identified include 11 of 12 reactions whose corresponding gene deletions were 
determined experimentally to be lethal for growth on glucose. Earlier analyses 
based on single gene deletions conducted with this model using linear 
optimization identified only 7 of 12 lethal gene deletions motivating the 
importance of considering simultaneous gene deletions within an MILP 
framework. 

In the second case study, the uptake or secretion of any organic 
metabolite is enabled. The amount of organic material entering the network is 
kept consistent with the first case study by allowing the uptake of a 
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constrained amount of carbon atoms (< 60 mmol/gDW hr). Unconstrained 
uptake routes for oxygen, inorganic phosphate, sulfate, and ammonia are also 
provided as in the first study. Under these "ideal" uptake conditions, we find 
that a maximum growth rate (MGR) of 1.341 g biomass/gDW hr is attainable 

5 requiring at least 201 metabolic reactions. The fact that only five amino acids 
are imported under maximum growth (i.e., MGR) conditions indicates that it is 
stoichiometricaUy more favorable to produce most amino acids internally 
rather than transport them into the cell from the medium. 

This trend, however, is quickly reversed as the growth rate requirement 

10 is reduced. This reversal yields a corresponding sharp decrease in the total 
number of required reactions as a direct result of the importation of an 
increasing number of metaboUtes at sub-maximum target growth' demands. 
The table of Figure 23 lists the metabolites uptaken or secreted at each target 
growth rate, while Figure 24 (100% - 90% of MGR) and Figure 25 (100% - 1% of 

15 MGR) illustrate the number of required metabolic reactions needed to attain 
various target growth demands. The rapid reduction in size of the minimal 
reaction sets by importing an increasing nxunber of metabolites as the biomass 
demands are lessened (see Figure 23) continues imtil the growtii demands are 
reduced to about 90% firom the MGR. Below this growth target (see Figure 25) 

20 additional but modest reductions are achieved primarily through flux 

redirections. Figure 26 summarizes the reactions which are being removed or 
added to the minimal reaction set as the growth target is successively lowered. 
The smallest mixiimal reaction network for the second case study, comprised of 
122 reactions, is reached when the targiet growth demands are lowered to 10% 

25 of the MGR. This minimal network is comprised mostly of cell envelope and 
membrane lipid biosynthetic reactions, along with a number of transport and 
salvage pathway reactions, as shown in Figure 27. As in the glucose-only 
study, multiple minimal reaction sets for multi>organic uptake case are 
expected. 
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